Over the past decade, with the growing influence of social media and heightened popularity of global stars, interest in women’s sports has skyrocketed. However, even with the appearance of high-profile campaigns for equity in sports1, it has nonetheless been difficult to cultivate an audience and build a market for women’s sports when they receive minimal media coverage compared to their male counterparts2. Although major tournaments, such as the WNBA Finals, are strong pulls for sports enthusiasts, the lack of nationally televised games and limited marketing budgets to showcase the players throughout the year has been a barrier for sports fans — especially established NBA fans — to consistently engage with the WNBA even when they are interested. Furthermore, although we have seen how statistics can fuel sports passion and storytelling, it was only recently that data and advanced statistics for the WNBA became easily accessible to the public3. Therefore, we seek to not only provide more convenient and accessible information on the WNBA players, but to also promote sustained fan engagement and interactions with the league as well.
Our project aims to make the following contributions, which will be displayed in a public facing Shiny App:
- Develop archetypes of current WNBA players based on each player’s abilities and overall result/performance-based statistics
- Conduct archetype exploration on NBA players using the same variables to discover similarities and differences in the type of players between the respective leagues
- Draw comparisons between WNBA players to NBA players
- Each WNBA player (that has played significant minutes/games) will be matched with 3-5 similar NBA players based SOLELY on their tendencies/playstyle
Ultimately, we believe that labeling each WNBA player with an archetype and developing an NBA player comparison can boost year-round engagement and bring the WNBA into the spotlight and keep them there for years to come.
To define player archetypes in the WNBA and compare WNBA players to NBA players, data must be gathered on a seasonal basis. Using Basketball Reference4, player statistics were gathered dating back to 2018 (the first year WNBA play-by-play and shot location became available). Relevant variables included:
This data allows both playstyle and effectiveness to be evaluated and considered when developing player archetypes and subsequently creating player comparisons between WNBA players and NBA players.
Cleaning the WNBA all stats dataset:
Cleaning the NBA all stats dataset:
The following visualizations and remaining analyses are based on these subsets of players.
Before modeling or clustering, the distributions of variables within the WNBA dataset were examined to better understand the relationships that are present:
Position Distribution:
Distribution of Minutes Per Game across WNBA Players:
Shot distance:
Field Goals:
Assist & Block Percentage:
To choose which subset of variables were important in determining the archetypes, principal component analysis (PCA) was used to reduce the dimensionality of the feature space.
To allow for some uncertainty in the clustering results, a Gaussian Mixture Model (GMM) was used to yield soft assignments for clustering the players.
Before running a model to derive playstyle comparisons, variables related to playstyle (and not results) were selected. These included:
To develop a model that outputs an NBA comparison for a WNBA player’s playstyle, a Gaussian Mixture Model (GMM) was trained using the past 5 seasons of NBA data (2018-2022). In doing so, clusters of NBA players were created with corresponding probabilities for each player belonging to each cluster. WNBA player profiles consisting of the same variables were then fed into the model, similarly receiving probabilities of belonging to each cluster. To derive the NBA player most similar to a WNBA player, the Euclidean distance between a WNBA player’s cluster probabilities and all NBA player’s cluster probabilities was calculated. The NBA player with the lowest corresponding distance of probabilities was selected as the comparison for the WNBA player of interest. A GMM was chosen over K-Means clustering to take advantage of soft assignments and the probabilities generated by a GMM.
These probabilities of an observation being assigned to each cluster were then used to compare players. This was done by computing the Euclidean distance between the cluster probabilities for each observation and taking the ‘closest’ observation (minimum distance). This allowed a WNBA player to be fed into the model, the distance to be calculated, and to take the ‘closest’ NBA player and declare that as the player comparison.
After applying a Gaussian Mixture Model to the subset of variables informed by using PCA, 5 clusters were returned for both the WNBA and NBA. Through the amalgamation of basketball knowledge and meticulously observing and comparing the cluster averages on all the performance-based variables in our datasets, simple archetype labels were placed on the clusters in each league. After this process, our results indicate the archetypes between the leagues are nearly identical — both the WNBA and NBA have reserves, traditional bigs, facilitators/shooters, and primary scores/initiators as 4 of their 5 clusters. with the only divergence behind a cluster of players who were shooting threats and a group who were roleplayers for the WNBA and NBA respectively.
The full descriptions of the archetypes are displayed below:
*Player examples are from the 2021 season
*Player examples are from the 2021 season
The following visualizations helped distinguish clusters and inform the archetype labeling:
Points & 3 point attempts:
Rebounds & Assists:
Usage Percentage & Player Efficiency Rating (PER):
Usage % = an estimate of the percentage of team plays used by a player when they were on the floor
PER = a measure of per-minute production standardized such that the league average is 15
Offensive & Defensive Win Share (OWS/DWS):
OWS = an estimate of the number of wins contributed by a player due to offense
DWS = an estimate of the number of wins contributed by a player due to defense
Model uncertainty:
Like any other model, the classification of observations into clusters involves uncertainty. In the Gaussian Mixture Model that we used, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\) are the corresponding probabilities for a player to be assigned to each of the 5 clusters. The plots below indicate the 3 players in each cluster who had the highest cluster assignment uncertainty in the 2021 season.
The Gaussian Mixture Model built using NBA players using 8 playstyle variables produced 7 clusters. The Adjusted Rand Index with the performance based NBA clusters was 0.248. As part of the comparison process, the uncertainty for each player was also calculated. As previously mentioned, uncertainty is defined as \(1 - max(p_i)\), where \(p_i\), in this case, are the corresponding probabilities for a player to be assigned to each of the 7 clusters. The 3 WNBA and NBA players with the highest uncertainty for each cluster are shown below (2021 season only).
Table of comparisons for 3 example players:
| Sample of WNBA to NBA Comparisons | ||||||||
|---|---|---|---|---|---|---|---|---|
| WNBA Player | NBA Comp #1 | Distance 1 | NBA Comp #2 | Distance 2 | NBA Comp #3 | Distance 3 | NBA Comp #4 | Distance 4 |
| Kelsey Plum | Fred VanVleet | 0.005 | Anfernee Simons | 0.029 | Brandon Williams | 0.044 | Bones Hyland | 0.056 |
| Allie Quigley | Payton Pritchard | 0.068 | Kira Lewis Jr. | 0.220 | Jaylen Nowell | 0.231 | Davion Mitchell | 0.250 |
| Breanna Stewart | Karl-Anthony Towns | 0.008 | Josh Giddey | 0.015 | Brandon Ingram | 0.017 | Anthony Davis | 0.017 |
Initial shiny app examples for the 3 players in the table above:
The sample player comparisons listed in the table above do seem to pass the ‘eye test’. For example, Allie Quigley is known as a sharpshooter, very similar to her ‘closest’ NBA comparison Payton Pritchard
The 10 archetypes defined using results-based statistics suggest that NBA and WNBA have nearly identical archetypes
The WNBA and NBA clustering with performance-based variables produced 5 clusters each, while the NBA clustering with playstyle variables produced 7 variables.
While it is difficult to measure results of unsupervised learning such as GMMs, the Adjusted Rand Index (ARI) can compare two classifications (such as the NBA performance based and NBA playstyle based clustering). The ARI of 0.248 between these two classifications suggests that they were not completely random partitions, but that they were far from identical clusters.
MORE TO BE ADDED LATER
We would like to first express our gratitude toward Carnegie Mellon’s Statistics & Data Science Department for providing us a great opportunity to complete a project on sports analytics. In particular, this work would not have been possible without the valuable guidance and support of Dr. Ron Yurko, the lead instructor and director of CMSAC, as well as Maxsim Horowitz, senior data analyst for the Atlanta Hawks, for advising our project. We are also grateful to all of those with whom we have had the pleasure to work during this and other related projects, including our fellow students and teaching assistants.
[1] https://www.teamheroine.com/blog/the-10-best-womens-sport-campaigns-of-2020
[2] https://www.si.com/sports-illustrated/2021/03/24/womens-sports-gender-study-discrepancy
[3] https://niemanreports.org/articles/covering-womens-sports/
[4] https://www.basketball-reference.com/wnba/years/2022_per_game.html
Carnegie Mellon University, amorai@cmu.edu↩︎
Harvard University, mhombergbertley@college.harvard.edu↩︎
St. Olaf College, noecke2@stolaf.edu↩︎